[SDAG] Avoid creating redundant stack slots when lowering FSINCOS #108401

MacDue · 2024-09-12T14:42:24Z

When lowering FSINCOS to a library call (that takes output pointers) we can avoid creating new stack allocations if the results of the FSINCOS are being stored. Instead, we can take the destination pointers from the stores and pass those to the library call.

Note: As a NFC this also adds (and uses) RTLIB::getFSINCOS().

llvmbot · 2024-09-12T14:42:58Z

@llvm/pr-subscribers-llvm-selectiondag

@llvm/pr-subscribers-backend-aarch64

Author: Benjamin Maxwell (MacDue)

Changes

When lowering FSINCOS to a library call (that takes output pointers) we can avoid creating new stack allocations if the results of the FSINCOS are being stored. Instead, we can take the destination pointers from the stores and pass those to the library call.

Note: As a NFC this also adds RTLIB::getFSINCOS() and uses ExpandLibCall when lowering FSINCOS.

Full diff: https://github.com/llvm/llvm-project/pull/108401.diff

4 Files Affected:

(modified) llvm/include/llvm/CodeGen/RuntimeLibcallUtil.h (+4)
(modified) llvm/lib/CodeGen/SelectionDAG/LegalizeDAG.cpp (+48-51)
(modified) llvm/lib/CodeGen/TargetLoweringBase.cpp (+5)
(added) llvm/test/CodeGen/AArch64/sincos-stack-slots.ll (+115)

diff --git a/llvm/include/llvm/CodeGen/RuntimeLibcallUtil.h b/llvm/include/llvm/CodeGen/RuntimeLibcallUtil.h
index 7a131645893921..045ec7d3653119 100644
--- a/llvm/include/llvm/CodeGen/RuntimeLibcallUtil.h
+++ b/llvm/include/llvm/CodeGen/RuntimeLibcallUtil.h
@@ -62,6 +62,10 @@ Libcall getLDEXP(EVT RetVT);
 /// UNKNOWN_LIBCALL if there is none.
 Libcall getFREXP(EVT RetVT);
 
+/// getFSINCOS - Return the FSINCOS_* value for the given types, or
+/// UNKNOWN_LIBCALL if there is none.
+Libcall getFSINCOS(EVT RetVT);
+
 /// Return the SYNC_FETCH_AND_* value for the given opcode and type, or
 /// UNKNOWN_LIBCALL if there is none.
 Libcall getSYNC(unsigned Opc, MVT VT);
diff --git a/llvm/lib/CodeGen/SelectionDAG/LegalizeDAG.cpp b/llvm/lib/CodeGen/SelectionDAG/LegalizeDAG.cpp
index f5fbc01cd95e96..7b0dc63c473d25 100644
--- a/llvm/lib/CodeGen/SelectionDAG/LegalizeDAG.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/LegalizeDAG.cpp
@@ -2326,15 +2326,7 @@ SelectionDAGLegalize::ExpandDivRemLibCall(SDNode *Node,
 
 /// Return true if sincos libcall is available.
 static bool isSinCosLibcallAvailable(SDNode *Node, const TargetLowering &TLI) {
-  RTLIB::Libcall LC;
-  switch (Node->getSimpleValueType(0).SimpleTy) {
-  default: llvm_unreachable("Unexpected request for libcall!");
-  case MVT::f32:     LC = RTLIB::SINCOS_F32; break;
-  case MVT::f64:     LC = RTLIB::SINCOS_F64; break;
-  case MVT::f80:     LC = RTLIB::SINCOS_F80; break;
-  case MVT::f128:    LC = RTLIB::SINCOS_F128; break;
-  case MVT::ppcf128: LC = RTLIB::SINCOS_PPCF128; break;
-  }
+  RTLIB::Libcall LC = RTLIB::getFSINCOS(Node->getSimpleValueType(0).SimpleTy);
   return TLI.getLibcallName(LC) != nullptr;
 }
 
@@ -2355,68 +2347,73 @@ static bool useSinCos(SDNode *Node) {
 }
 
 /// Issue libcalls to sincos to compute sin / cos pairs.
-void
-SelectionDAGLegalize::ExpandSinCosLibCall(SDNode *Node,
-                                          SmallVectorImpl<SDValue> &Results) {
-  RTLIB::Libcall LC;
-  switch (Node->getSimpleValueType(0).SimpleTy) {
-  default: llvm_unreachable("Unexpected request for libcall!");
-  case MVT::f32:     LC = RTLIB::SINCOS_F32; break;
-  case MVT::f64:     LC = RTLIB::SINCOS_F64; break;
-  case MVT::f80:     LC = RTLIB::SINCOS_F80; break;
-  case MVT::f128:    LC = RTLIB::SINCOS_F128; break;
-  case MVT::ppcf128: LC = RTLIB::SINCOS_PPCF128; break;
-  }
-
-  // The input chain to this libcall is the entry node of the function.
-  // Legalizing the call will automatically add the previous call to the
-  // dependence.
-  SDValue InChain = DAG.getEntryNode();
-
+void SelectionDAGLegalize::ExpandSinCosLibCall(
+    SDNode *Node, SmallVectorImpl<SDValue> &Results) {
   EVT RetVT = Node->getValueType(0);
   Type *RetTy = RetVT.getTypeForEVT(*DAG.getContext());
 
   TargetLowering::ArgListTy Args;
-  TargetLowering::ArgListEntry Entry;
+  TargetLowering::ArgListEntry Entry{};
+
+  // Find users of the node that store the results. The destination pointers
+  // can be used instead of creating stack allocations.
+  StoreSDNode *SinST = nullptr;
+  StoreSDNode *CosST = nullptr;
+  for (SDNode::use_iterator UI = Node->use_begin(), UE = Node->use_end();
+       UI != UE; ++UI) {
+    SDUse &Use = UI.getUse();
+    SDNode *User = Use.getUser();
+    if (!ISD::isNormalStore(User))
+      continue;
+    auto *ST = cast<StoreSDNode>(User);
+    if (Use.getResNo() == 0)
+      SinST = ST;
+    if (Use.getResNo() == 1)
+      CosST = ST;
+  }
 
   // Pass the argument.
   Entry.Node = Node->getOperand(0);
   Entry.Ty = RetTy;
-  Entry.IsSExt = false;
-  Entry.IsZExt = false;
   Args.push_back(Entry);
 
+  auto GetOrCreateOutPointer = [&](StoreSDNode *MaybeStore) {
+    if (MaybeStore)
+      return std::make_pair(MaybeStore->getBasePtr(),
+                            MaybeStore->getPointerInfo());
+    SDValue StackSlot = DAG.CreateStackTemporary(RetVT);
+    auto PtrInfo = MachinePointerInfo::getFixedStack(
+        DAG.getMachineFunction(),
+        cast<FrameIndexSDNode>(StackSlot)->getIndex());
+    return std::make_pair(StackSlot, PtrInfo);
+  };
+
   // Pass the return address of sin.
-  SDValue SinPtr = DAG.CreateStackTemporary(RetVT);
-  Entry.Node = SinPtr;
+  auto SinPtr = GetOrCreateOutPointer(SinST);
+  Entry.Node = SinPtr.first;
   Entry.Ty = PointerType::getUnqual(RetTy->getContext());
-  Entry.IsSExt = false;
-  Entry.IsZExt = false;
   Args.push_back(Entry);
 
   // Also pass the return address of the cos.
-  SDValue CosPtr = DAG.CreateStackTemporary(RetVT);
-  Entry.Node = CosPtr;
+  auto CosPtr = GetOrCreateOutPointer(CosST);
+  Entry.Node = CosPtr.first;
   Entry.Ty = PointerType::getUnqual(RetTy->getContext());
-  Entry.IsSExt = false;
-  Entry.IsZExt = false;
   Args.push_back(Entry);
 
-  SDValue Callee = DAG.getExternalSymbol(TLI.getLibcallName(LC),
-                                         TLI.getPointerTy(DAG.getDataLayout()));
-
-  SDLoc dl(Node);
-  TargetLowering::CallLoweringInfo CLI(DAG);
-  CLI.setDebugLoc(dl).setChain(InChain).setLibCallee(
-      TLI.getLibcallCallingConv(LC), Type::getVoidTy(*DAG.getContext()), Callee,
-      std::move(Args));
+  RTLIB::Libcall LC = RTLIB::getFSINCOS(RetVT);
+  auto [Call, Chain] = ExpandLibCall(LC, Node, std::move(Args), false);
 
-  std::pair<SDValue, SDValue> CallInfo = TLI.LowerCallTo(CLI);
+  // Replace explict stores with the library call.
+  for (StoreSDNode *ST : {SinST, CosST}) {
+    if (ST)
+      DAG.ReplaceAllUsesOfValueWith(SDValue(ST, 0), Chain);
+  }
 
-  Results.push_back(
-      DAG.getLoad(RetVT, dl, CallInfo.second, SinPtr, MachinePointerInfo()));
-  Results.push_back(
-      DAG.getLoad(RetVT, dl, CallInfo.second, CosPtr, MachinePointerInfo()));
+  SDLoc DL(Node);
+  for (auto [Ptr, PtrInfo] : {SinPtr, CosPtr}) {
+    SDValue LoadExp = DAG.getLoad(RetVT, DL, Chain, Ptr, PtrInfo);
+    Results.push_back(LoadExp);
+  }
 }
 
 SDValue SelectionDAGLegalize::expandLdexp(SDNode *Node) const {
diff --git a/llvm/lib/CodeGen/TargetLoweringBase.cpp b/llvm/lib/CodeGen/TargetLoweringBase.cpp
index eb3190c7cd247a..8e5a0a0ca06082 100644
--- a/llvm/lib/CodeGen/TargetLoweringBase.cpp
+++ b/llvm/lib/CodeGen/TargetLoweringBase.cpp
@@ -398,6 +398,11 @@ RTLIB::Libcall RTLIB::getFREXP(EVT RetVT) {
                       FREXP_PPCF128);
 }
 
+RTLIB::Libcall RTLIB::getFSINCOS(EVT RetVT) {
+  return getFPLibCall(RetVT, SINCOS_F32, SINCOS_F64, SINCOS_F80, SINCOS_F128,
+                      SINCOS_PPCF128);
+}
+
 RTLIB::Libcall RTLIB::getOutlineAtomicHelper(const Libcall (&LC)[5][4],
                                              AtomicOrdering Order,
                                              uint64_t MemSize) {
diff --git a/llvm/test/CodeGen/AArch64/sincos-stack-slots.ll b/llvm/test/CodeGen/AArch64/sincos-stack-slots.ll
new file mode 100644
index 00000000000000..afd054a83a5014
--- /dev/null
+++ b/llvm/test/CodeGen/AArch64/sincos-stack-slots.ll
@@ -0,0 +1,115 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc -mtriple=aarch64-linux-gnu -o - %s | FileCheck %s
+
+define { float, float } @sincos_f32_value_return(float %x) {
+; CHECK-LABEL: sincos_f32_value_return:
+; CHECK:       // %bb.0: // %entry
+; CHECK-NEXT:    str x30, [sp, #-16]! // 8-byte Folded Spill
+; CHECK-NEXT:    .cfi_def_cfa_offset 16
+; CHECK-NEXT:    .cfi_offset w30, -16
+; CHECK-NEXT:    add x0, sp, #12
+; CHECK-NEXT:    add x1, sp, #8
+; CHECK-NEXT:    bl sincosf
+; CHECK-NEXT:    ldp s1, s0, [sp, #8]
+; CHECK-NEXT:    ldr x30, [sp], #16 // 8-byte Folded Reload
+; CHECK-NEXT:    ret
+entry:
+  %sin = tail call float @llvm.sin.f32(float %x)
+  %cos = tail call float @llvm.cos.f32(float %x)
+  %ret_0 = insertvalue { float, float } poison, float %sin, 0
+  %ret_1 = insertvalue { float, float } %ret_0, float %cos, 1
+  ret { float, float } %ret_1
+}
+
+define void @sincos_f32_ptr_return(float %x, ptr %out_sin, ptr %out_cos) {
+; CHECK-LABEL: sincos_f32_ptr_return:
+; CHECK:       // %bb.0: // %entry
+; CHECK-NEXT:    str x30, [sp, #-16]! // 8-byte Folded Spill
+; CHECK-NEXT:    .cfi_def_cfa_offset 16
+; CHECK-NEXT:    .cfi_offset w30, -16
+; CHECK-NEXT:    bl sincosf
+; CHECK-NEXT:    ldr x30, [sp], #16 // 8-byte Folded Reload
+; CHECK-NEXT:    ret
+entry:
+  %sin = tail call float @llvm.sin.f32(float %x)
+  %cos = tail call float @llvm.cos.f32(float %x)
+  store float %sin, ptr %out_sin, align 4
+  store float %cos, ptr %out_cos, align 4
+  ret void
+}
+
+define float @sincos_f32_mixed_return(float %x, ptr %out_sin) {
+; CHECK-LABEL: sincos_f32_mixed_return:
+; CHECK:       // %bb.0: // %entry
+; CHECK-NEXT:    str x30, [sp, #-16]! // 8-byte Folded Spill
+; CHECK-NEXT:    .cfi_def_cfa_offset 16
+; CHECK-NEXT:    .cfi_offset w30, -16
+; CHECK-NEXT:    add x1, sp, #12
+; CHECK-NEXT:    bl sincosf
+; CHECK-NEXT:    ldr s0, [sp, #12]
+; CHECK-NEXT:    ldr x30, [sp], #16 // 8-byte Folded Reload
+; CHECK-NEXT:    ret
+entry:
+  %sin = tail call float @llvm.sin.f32(float %x)
+  %cos = tail call float @llvm.cos.f32(float %x)
+  store float %sin, ptr %out_sin, align 4
+  ret float %cos
+}
+
+define { double, double } @sincos_f64_value_return(double %x) {
+; CHECK-LABEL: sincos_f64_value_return:
+; CHECK:       // %bb.0: // %entry
+; CHECK-NEXT:    sub sp, sp, #32
+; CHECK-NEXT:    str x30, [sp, #16] // 8-byte Folded Spill
+; CHECK-NEXT:    .cfi_def_cfa_offset 32
+; CHECK-NEXT:    .cfi_offset w30, -16
+; CHECK-NEXT:    add x0, sp, #24
+; CHECK-NEXT:    add x1, sp, #8
+; CHECK-NEXT:    bl sincos
+; CHECK-NEXT:    ldr d0, [sp, #24]
+; CHECK-NEXT:    ldr d1, [sp, #8]
+; CHECK-NEXT:    ldr x30, [sp, #16] // 8-byte Folded Reload
+; CHECK-NEXT:    add sp, sp, #32
+; CHECK-NEXT:    ret
+entry:
+  %sin = tail call double @llvm.sin.f64(double %x)
+  %cos = tail call double @llvm.cos.f64(double %x)
+  %ret_0 = insertvalue { double, double } poison, double %sin, 0
+  %ret_1 = insertvalue { double, double } %ret_0, double %cos, 1
+  ret { double, double } %ret_1
+}
+
+define void @sincos_f64_ptr_return(double %x, ptr %out_sin, ptr %out_cos) {
+; CHECK-LABEL: sincos_f64_ptr_return:
+; CHECK:       // %bb.0: // %entry
+; CHECK-NEXT:    str x30, [sp, #-16]! // 8-byte Folded Spill
+; CHECK-NEXT:    .cfi_def_cfa_offset 16
+; CHECK-NEXT:    .cfi_offset w30, -16
+; CHECK-NEXT:    bl sincos
+; CHECK-NEXT:    ldr x30, [sp], #16 // 8-byte Folded Reload
+; CHECK-NEXT:    ret
+entry:
+  %sin = tail call double @llvm.sin.f64(double %x)
+  %cos = tail call double @llvm.cos.f64(double %x)
+  store double %sin, ptr %out_sin, align 4
+  store double %cos, ptr %out_cos, align 4
+  ret void
+}
+
+define double @sincos_f64_mixed_return(double %x, ptr %out_sin) {
+; CHECK-LABEL: sincos_f64_mixed_return:
+; CHECK:       // %bb.0: // %entry
+; CHECK-NEXT:    str x30, [sp, #-16]! // 8-byte Folded Spill
+; CHECK-NEXT:    .cfi_def_cfa_offset 16
+; CHECK-NEXT:    .cfi_offset w30, -16
+; CHECK-NEXT:    add x1, sp, #8
+; CHECK-NEXT:    bl sincos
+; CHECK-NEXT:    ldr d0, [sp, #8]
+; CHECK-NEXT:    ldr x30, [sp], #16 // 8-byte Folded Reload
+; CHECK-NEXT:    ret
+entry:
+  %sin = tail call double @llvm.sin.f64(double %x)
+  %cos = tail call double @llvm.cos.f64(double %x)
+  store double %sin, ptr %out_sin, align 4
+  ret double %cos
+}

MacDue · 2024-09-12T14:44:53Z

Note: See the second commit for the diff between the precommit and the new changes (diff here).

llvm/lib/CodeGen/SelectionDAG/LegalizeDAG.cpp

llvm/test/CodeGen/AArch64/sincos-stack-slots.ll

RKSimon · 2024-09-12T16:02:02Z

Do we have anything else that could use this? modf comes to mind, but I don't think we have RTLIB support for that.

MacDue · 2024-09-12T16:17:27Z

Do we have anything else that could use this? modf comes to mind, but I don't think we have RTLIB support for that.

FFREXP has the same issue due to materializing a stack slot for the i32 result: https://godbolt.org/z/3EK3EKMWr

llvm/lib/CodeGen/SelectionDAG/LegalizeDAG.cpp

github-actions · 2024-09-17T13:13:00Z

✅ With the latest revision this PR passed the C/C++ code formatter.

arsenm

I think you can drop all of the chain logic and just use DAG.getEntryNode. FSINCOS is derived from no-memory/errno operations, and thus assumed to not care about the state of errno. You don't need all of this code trying to figure out the suitable chain

MacDue · 2024-09-20T16:45:30Z

I think you can drop all of the chain logic and just use DAG.getEntryNode. FSINCOS is derived from no-memory/errno operations, and thus assumed to not care about the state of errno. You don't need all of this code trying to figure out the suitable chain

I don't think that's safe with the stores. Given this IR (which contains a sincos intrinsic that lowers to FSINCOS):

define void @test(float %v, ptr %x, ptr %y) {
  call void @foo(ptr %x, ptr %y)

  %sret = call { float, float } @llvm.sincos.f32(float %v)
  %ret0 = extractvalue { float, float } %sret, 0
  %ret1 = extractvalue { float, float } %sret, 1
  store float %ret0, ptr %x, align 4
  store float %ret1, ptr %y, align 4

  ret void
}

declare void @foo(ptr, ptr)

The stores for the results of FSINCOS take the chain from the call to @foo as the input. With the current logic, this lowers correctly (a call to foo followed by a call to sincosf). If I instead make the input chain DAG.getEntryNode() the call to foo is lost.

llvm/lib/CodeGen/SelectionDAG/LegalizeDAG.cpp

When lowering `FSINCOS` to a library call (that takes output pointers) we can avoid creating new stack allocations if the results of the `FSINCOS` are being stored. Instead, we can take the destination pointers from the stores and pass those to the library call.

zmodem · 2024-11-07T14:35:17Z

We found an interesting issue after this change: #115323

MacDue · 2024-11-07T14:56:32Z

We found an interesting issue after this change: #115323

Thanks, I'll take a look 👍

wjristow · 2025-05-19T03:29:29Z

Hi @MacDue ,

We've run into a run-time failure exposed by this change. I've just created 140491 for it. Could you have a look, please. Thanks!

MacDue · 2025-05-19T10:36:36Z

Hi @MacDue ,

We've run into a run-time failure exposed by this change. I've just created 140491 for it. Could you have a look, please. Thanks!

I believe #140525 should fix this issue. Thanks for report 👍

MacDue requested review from SamTebbs33, huntergr-arm, sdesmalen-arm, kmclaughlin-arm and NickGuy-Arm September 12, 2024 14:42

llvmbot added backend:AArch64 llvm:SelectionDAG SelectionDAGISel as well labels Sep 12, 2024

MacDue commented Sep 12, 2024

View reviewed changes

llvm/lib/CodeGen/SelectionDAG/LegalizeDAG.cpp Outdated Show resolved Hide resolved

arsenm reviewed Sep 12, 2024

View reviewed changes

MacDue force-pushed the sincos_stack_slots branch 2 times, most recently from 9855e54 to f46d132 Compare September 12, 2024 18:49

arsenm reviewed Sep 13, 2024

View reviewed changes

llvm/lib/CodeGen/SelectionDAG/LegalizeDAG.cpp Outdated Show resolved Hide resolved

llvm/lib/CodeGen/SelectionDAG/LegalizeDAG.cpp Outdated Show resolved Hide resolved

llvm/lib/CodeGen/SelectionDAG/LegalizeDAG.cpp Outdated Show resolved Hide resolved

MacDue force-pushed the sincos_stack_slots branch from 596fc79 to 74113f0 Compare September 16, 2024 10:47

RKSimon reviewed Sep 17, 2024

View reviewed changes

llvm/lib/CodeGen/SelectionDAG/LegalizeDAG.cpp Outdated Show resolved Hide resolved

MacDue force-pushed the sincos_stack_slots branch 2 times, most recently from 5e63a71 to f01dbd1 Compare September 17, 2024 14:56

arsenm reviewed Sep 20, 2024

View reviewed changes

arsenm reviewed Sep 21, 2024

View reviewed changes

llvm/lib/CodeGen/SelectionDAG/LegalizeDAG.cpp Outdated Show resolved Hide resolved

MacDue added 6 commits September 23, 2024 13:46

Precommit test for sincos stack slots

7e706b4

Fixups

988273f

Avoid cyclic in chains

529a94e

Fixups

235b629

Use for-range loop

bf8f0d9

MacDue force-pushed the sincos_stack_slots branch from f01dbd1 to 052fddd Compare September 23, 2024 13:47

Fixups

e15c4e3

MacDue force-pushed the sincos_stack_slots branch from 052fddd to e15c4e3 Compare September 23, 2024 13:52

arsenm approved these changes Sep 24, 2024

View reviewed changes

MacDue merged commit 3073c3c into llvm:main Sep 24, 2024
8 checks passed

MacDue deleted the sincos_stack_slots branch September 24, 2024 12:36

[SDAG] Avoid creating redundant stack slots when lowering FSINCOS #108401

[SDAG] Avoid creating redundant stack slots when lowering FSINCOS #108401

Uh oh!

Conversation

MacDue commented Sep 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Sep 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MacDue commented Sep 12, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

RKSimon commented Sep 12, 2024

Uh oh!

MacDue commented Sep 12, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Sep 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

arsenm left a comment

Choose a reason for hiding this comment

Uh oh!

MacDue commented Sep 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zmodem commented Nov 7, 2024

Uh oh!

MacDue commented Nov 7, 2024

Uh oh!

wjristow commented May 19, 2025

Uh oh!

MacDue commented May 19, 2025

Uh oh!

Uh oh!

MacDue commented Sep 12, 2024 •

edited

Loading

llvmbot commented Sep 12, 2024 •

edited

Loading

github-actions bot commented Sep 17, 2024 •

edited

Loading

MacDue commented Sep 20, 2024 •

edited

Loading